1 Bicycle Transportation

Using a bike instead of a car for short trips reduces the carbon footprint of that trip by about 75% [source]. Shared public systems help increase the use of bicycles for transportation. The United States Department of Transportation Bureau of Transportation Statistics (BTS) tracks bikeshare (docked and dockless) and E-scooter systems, and presents a nice summary of these micromobility systems at this site. Here is their summary video:

Let’s focus on bikeshare systems and load in data on them from BTS:

bikes_scooter<-read_csv("http://faculty.olin.edu/dshuman/DS/Bikeshare_Docked_and_Dockless_and_E-scooter_Systems_by_Year_and_City_Served.csv")%>%
    janitor::clean_names()

bikeshare<-bikes_scooter%>%
  select(city,state,year,dock_ct)%>%
  filter(dock_ct>0)%>%
  arrange(city,state,year)
Table 1.1: A few random rows of the bikeshare table. Each row is a city-state-year triplet. The dock_ct variable gives the number of docked stations in the city.
city state year dock_ct
Long Beach CA 2021 86
Blacksburg VA 2019 12
Columbia MD 2018 9
Los Angeles CA 2017 119
Harrisburg PA 2019 11
Fort Lauderdale FL 2020 21

We also have a second table from Wikipedia that has the population and land area for the 250 most populus cities in the United States, with all data from the 2020 Census:

top_250_cities_by_pop<-read_csv("http://faculty.olin.edu/dshuman/DS/us_top_250_cities_by_population.csv")%>%
  janitor::clean_names()%>%
  select(city,state,pop_2020,area_sq_miles)
Table 1.2: A few random rows of the top_250_cities_by_pop table. Each row is a city-state pair, and the table includes the 250 most populus cities, according to the 2020 U.S. Census.
city state pop_2020 area_sq_miles
Lincoln NE 291082 97.7
Lexington KY 322570 283.6
Pasadena TX 151950 43.7
Boise ID 235684 84.0
Victorville CA 134810 73.7
Roseville CA 147773 44.1


2 Joining Two Data Frames

Additional reading:

A join is a data verb that combines two tables.

There are several kinds of join.

2.1 Establishing a match between cases

A match between a case in the left table and a case in the right table is made based on the values in pairs of corresponding variables.

  • You specify which pairs to use.
  • A pair is a variable from the left table and a variable from the right table.
  • Cases must have exactly equal values in the left variable and right variable for a match to be made.

As an example, we’ll examine different ways to combine the bikeshare and top_250_cities_by_pop tables.

The very first question to ask is what variables the two tables have in common. In this case, it is the city and the state.

2.2 Mutating joins

The first class of joins are mutating joins, which add new variables (columns) to the left data table from matching observations in the right table.1

The main difference in the three mutating join options in this class is how they answer the following questions:

  1. What happens when a case in the right table has no matches in the left table?
  2. What happens when a case in the left table has no matches in the right table?

Three mutating join functions:

  • left_join(): the output has all cases from the left, regardless if there is a match in the right, but discards any cases in the right that do not have a match in the left.
  • inner_join(): the output has only the cases from the left with a match in the right.
  • full_join(): the output has all cases from the left and the right. This is less common than the first two join operators.

When there are multiple matches in the right table for a particular case in the left table, all three of these mutating join operators produce a separate case in the new table for each of the matches from the right.

One of the most common and useful mutating joins in one that translates levels of a variable to a new scale; e.g., a join that translates letter grades (e.g., “B”) into grade points (e.g., 3) for a GPA calculuation.

Example 2.1 (left_join) Let’s mutate two new columns to the bikeshare table by pulling in the population and land area from the top_250_cities_by_pop table:

bikeshare2<-bikeshare%>%
  left_join(top_250_cities_by_pop,by=c("state"="state","city"="city"))
Table 2.1: Two additional columns added to the bikeshare table, pulling in data from the top_250_cities_by_pop table when the city is in the 250 most populus in the United States.
city state year dock_ct pop_2020 area_sq_miles
Long Beach CA 2021 86 466742 50.7
Blacksburg VA 2019 12 NA NA
Columbia MD 2018 9 NA NA
Los Angeles CA 2017 119 3898747 469.5
Harrisburg PA 2019 11 NA NA
Fort Lauderdale FL 2020 21 182760 34.6

A few notes:

  1. We need to match both city and state, to distinguish for example between Rochester, NY and Rochester, MN.

  2. In this case, the variable names we are matching happen to be the same, but they don’t need to be. If the right table name was ST, we’d just change that part of the match to "state"="ST".

  3. If the by= is omitted from a join, then R will perform a natural join, which matches the two tables by all variables they have in common. In this case, that would yield the same result (but you should be careful in general):

bikeshare2<-bikeshare%>%
  left_join(top_250_cities_by_pop)
## Joining with `by = join_by(city, state)`

Example 2.2 (inner_join) The bikeshare2 table has a lot of rows corresponding to cities that aren’t in the 250 most populus. and thus have NAs in the last two columns. If we want to create a table that only has information for cities that are in both tables, we can instead use inner_join:

bikeshare3<-bikeshare%>%
  inner_join(top_250_cities_by_pop,by=c("state"="state","city"="city"))
Table 2.2: The inner_join only keeps cities that appear in both the bikeshare and top_250_cities_by_pop tables.
city state year dock_ct pop_2020 area_sq_miles
Salt Lake City UT 2018 34 199723 110.3
Chicago IL 2019 608 2746388 227.7
Honolulu HI 2018 136 350964 60.5
San Antonio TX 2021 55 1434625 498.8
New York NY 2021 1566 8804190 300.5
Milwaukee WI 2017 86 577222 96.2

2.3 Filtering joins

The second class of joins are filtering joins, which select specific cases from the left table based on whether they match an observation in the right table.

  • semi_join(): discards any cases in the left table that do not have a match in the right table. If there are multiple matches of right cases to a left case, it keeps just one copy of the left case.
  • anti_join(): discards any cases in the left table that have a match in the right table.

A particularly common employment of these joins is to use a filtered summary as a comparison to select a subset of the original cases, as follows.

Example 2.3 (semi_join to compare to a filtered summary) Compute a subset of the bikeshare table that only contains rows from the top three most populus cities in the United States.

Solution. This is where we can use semi_join to filter the rows of bikeshare:

top_three_populus<-top_250_cities_by_pop%>%
  arrange(desc(pop_2020))%>%
  head(n=3)
bikeshare_in_populus_cities<-bikeshare%>%
  semi_join(top_three_populus)
Table 2.3: This semi_join only keeps rows for cities that are in the top three most populus.
city state year dock_ct
Chicago IL 2015 469
Chicago IL 2016 582
Chicago IL 2017 574
Chicago IL 2018 608
Chicago IL 2019 608
Chicago IL 2020 612
Chicago IL 2021 671
Chicago IL 2022 1367
Chicago IL 2023 1636
Los Angeles CA 2016 104
Los Angeles CA 2017 119
Los Angeles CA 2018 127
Los Angeles CA 2019 127
Los Angeles CA 2020 213
Los Angeles CA 2021 218
Los Angeles CA 2022 219
Los Angeles CA 2023 223
New York NY 2015 507
New York NY 2016 664
New York NY 2017 811
New York NY 2018 841
New York NY 2019 841
New York NY 2020 1011
New York NY 2021 1566
New York NY 2022 1703
New York NY 2023 1954

Example 2.4 (anti_join) Which U.S. cities in the 250 most populus have no bikeshare program listed in the BTS data?

Solution. While semi_join keeps rows of the left table that have a match in the right table, anti_join keeps rows that do not have a match:

populus_cities_no_bike_scooter<-top_250_cities_by_pop%>%
  anti_join(bikeshare)%>%
  arrange(desc(pop_2020))
Table 2.4: The anti_join only keeps cities in the left table (top_250_cities_by_pop) that do not appear in right table (bikeshare). These are the six most populus of the 172 cities that fall into this category.
city state pop_2020 area_sq_miles
San Jose CA 1013240 178.3
Jacksonville FL 949611 747.3
Fresno CA 542107 115.2
Sacramento CA 524943 98.6
Mesa AZ 504258 138.7
Colorado Springs CO 478961 195.4

2.4 More join practice

Here is an additional table from Wikipedia with the top 150 United States cities by land area:

top_150_cities_by_area<-read_csv("http://faculty.olin.edu/dshuman/DS/us_top_150_cities_by_land_area.csv")%>%
  janitor::clean_names()%>%
  arrange(desc(land_area_mi2))
Table 2.5: The top ten United States cities by land area.
city st land_area_mi2 land_area_km2 water_area_mi2 water_area_km2 total_area_mi2 total_area_km2 population_2020
Sitka AK 2870.1 7434 1945.1 5038.0 4815.1 12471 8458
Juneau AK 2704.2 7004 550.7 1426.0 3254.9 8430 32255
Wrangell AK 2556.1 6620 920.6 2384.0 3476.7 9005 2127
Anchorage AK 1707.0 4421 239.7 621.0 1946.7 5042 291247
Tribune KS 778.2 2016 0.0 0.0 778.2 2016 1182
Jacksonville FL 747.3 1935 127.2 329.0 874.5 2265 949611
Anaconda MT 736.7 1908 4.7 12.0 741.4 1920 9421
Butte MT 715.8 1854 0.6 1.6 716.3 1855 34494
Houston TX 640.6 1659 31.2 81.0 671.8 1740 2304580
Oklahoma City OK 606.5 1571 14.3 37.0 620.8 1608 681054

Exercise 2.1 Use your wrangling skills to answer the following questions. Hint: start by thinking about what tables you might need to join (if any), determining which is the left table and which is the right table, and identifying the corresponding variables to match.

  1. Which of the 250 most populus cities in the United States had the most shared bicycle docking stations per square mile in 2022?

  2. Which of the 20 most populus cities in the United States has the most land area per 100000 inhabitants?

  3. Find a list of all of the 20 largest U.S. cities by land area that had a bikeshare program listed in the BTS data, any time between 2015 and 2023.

  4. How many U.S. cities are in the top 150 by population, but not in the top 150 by land area? Hint: It is an even number!

bikeshare3 %>% 
  mutate(stations_per_sq_mile = dock_ct/area_sq_miles) %>% 
  filter(year==2022) %>% 
  arrange(desc(stations_per_sq_mile))
## # A tibble: 49 × 7
##    city          state  year dock_ct pop_2020 area_sq_miles stations_per_sq_mile
##    <chr>         <chr> <dbl>   <dbl>    <dbl>         <dbl>                <dbl>
##  1 Washington    DC     2022     684   689545          61.1                11.2 
##  2 San Francisco CA     2022     494   873965          46.9                10.5 
##  3 Boston        MA     2022     434   675647          48.3                 8.99
##  4 Oakland       CA     2022     494   440646          55.9                 8.84
##  5 Minneapolis   MN     2022     399   429954          54                   7.39
##  6 Chicago       IL     2022    1367  2746388         228.                  6.00
##  7 New York      NY     2022    1703  8804190         300.                  5.67
##  8 Miami         FL     2022     160   442241          36                   4.44
##  9 Buffalo       NY     2022     110   278349          40.4                 2.72
## 10 Honolulu      HI     2022     134   350964          60.5                 2.21
## # ℹ 39 more rows
head(top_250_cities_by_pop, n=20L) %>% 
  mutate(area_per_100k = area_sq_miles/(pop_2020/100000)) %>% 
  arrange(desc(area_per_100k))
## # A tibble: 20 × 5
##    city          state pop_2020 area_sq_miles area_per_100k
##    <chr>         <chr>    <dbl>         <dbl>         <dbl>
##  1 Jacksonville  FL      949611         747.          78.7 
##  2 Indianapolis  IN      887642         362.          40.7 
##  3 Fort Worth    TX      918915         347.          37.8 
##  4 Charlotte     NC      874579         308.          35.3 
##  5 San Antonio   TX     1434625         499.          34.8 
##  6 Austin        TX      961855         320.          33.3 
##  7 Phoenix       AZ     1608139         518           32.2 
##  8 Houston       TX     2304580         640.          27.8 
##  9 Dallas        TX     1304379         340.          26.0 
## 10 Columbus      OH      905748         220           24.3 
## 11 San Diego     CA     1386932         326.          23.5 
## 12 Denver        CO      715522         153.          21.4 
## 13 San Jose      CA     1013240         178.          17.6 
## 14 Los Angeles   CA     3898747         470.          12.0 
## 15 Seattle       WA      737015          83.8         11.4 
## 16 Washington    DC      689545          61.1          8.86
## 17 Philadelphia  PA     1603797         134.           8.38
## 18 Chicago       IL     2746388         228.           8.29
## 19 San Francisco CA      873965          46.9          5.37
## 20 New York      NY     8804190         300.           3.41
top_150_cities_by_area %>% 
  semi_join(bikeshare, by="city") %>% 
  arrange(desc(land_area_mi2)) %>% 
  head(n=20L)
## # A tibble: 20 × 9
##    city          st    land_area_mi2 land_area_km2 water_area_mi2 water_area_km2
##    <chr>         <chr>         <dbl>         <dbl>          <dbl>          <dbl>
##  1 Houston       TX             641.          1659           31.2          81   
##  2 Oklahoma City OK             606.          1571           14.3          37   
##  3 Phoenix       AZ             518.          1342            1             2.6 
##  4 San Antonio   TX             499.          1292            5.8          15   
##  5 Nashville     TN             476.          1232           21.6          56   
##  6 Los Angeles   CA             470.          1219           31.1          81   
##  7 Indianapolis  IN             361            935            6.9          18   
##  8 Fort Worth    TX             350.           907            8.3          21   
##  9 Dallas        TX             340.           880           44           114   
## 10 Austin        TX             326.           845            6.6          17   
## 11 San Diego     CA             326.           845           46.3         120   
## 12 Louisville    KY             324.           840           16.5          43   
## 13 Kansas City   MO             315.           815            4.1          11   
## 14 Charlotte     NC             311.           805            2             5.2 
## 15 Memphis       TN             296.           767            7.6          20   
## 16 El Paso       TX             259.           670            0.6           1.6 
## 17 Tucson        AZ             242.           627            0.3           0.78
## 18 Chicago       IL             228.           590            6.8          18   
## 19 Huntsville    AL             224.           579            1.5           3.9 
## 20 Columbus      OH             221.           572            5.9          15   
## # ℹ 3 more variables: total_area_mi2 <dbl>, total_area_km2 <dbl>,
## #   population_2020 <dbl>
top_250_cities_by_pop %>% 
  arrange(desc(pop_2020)) %>% 
  head(n=150L) %>% 
  anti_join(top_150_cities_by_area, by = c("city"="city", "state"="st")) %>% 
  filter(city!="New York")
## # A tibble: 63 × 4
##    city          state pop_2020 area_sq_miles
##    <chr>         <chr>    <dbl>         <dbl>
##  1 San Francisco CA      873965          46.9
##  2 Washington    DC      689545          61.1
##  3 Boston        MA      675647          48.3
##  4 Long Beach    CA      466742          50.7
##  5 Miami         FL      442241          36  
##  6 Oakland       CA      440646          55.9
##  7 Minneapolis   MN      429954          54  
##  8 Cleveland     OH      372624          77.7
##  9 Honolulu      HI      350964          60.5
## 10 Anaheim       CA      346824          50.3
## # ℹ 53 more rows


3 Bicycle-Use Patterns

In this activity, you’ll examine some factors that may influence the use of bicycles in a bike-renting program. The data come from Washington, DC and cover the last quarter of 2014.

A typical Capital Bikeshare station. This one is at Florida and California, next to Pleasant Pops.

Figure 3.1: A typical Capital Bikeshare station. This one is at Florida and California, next to Pleasant Pops.

One of the vans used to redistribute bicycles to different stations.

Figure 3.2: One of the vans used to redistribute bicycles to different stations.

Two data tables are available:

Here is the code to read in the data:2

Trips <- read_csv("http://faculty.olin.edu/dshuman/DS/2014-Q4-Trips-History-Data-Small.csv")
#Trips <- read_csv("http://faculty.olin.edu/dshuman/DS/2014-Q4-Trips-History-Data.csv")
Stations<-read_csv("http://faculty.olin.edu/dshuman/DS/DC-Stations.csv")

The Trips data table is a random subset of 10,000 trips from the full quarterly data. Start with this small data table to develop your analysis commands. When you have this working well, you can access the full data set of more than 600,000 events by removing -Small from the file name.

3.1 Warm-up: Temporal patterns

It’s natural to expect that bikes are rented more at some times of day, some days of the week, some months of the year than others. The variable sdate gives the time (including the date) that the rental started.

Exercise 3.1 (Single variable temporal plots) Make the following plots and interpret them:

  1. A density plot of the events versus sdate. Use ggplot() and geom_density().
  2. A density plot of the events versus time of day. You can use mutate with lubridate::hour(), and lubridate::minute() to extract the hour of the day and minute within the hour from sdate. Hint: A minute is 1/60 of an hour, so create a field where 3:30 is 3.5 and 3:45 is 3.75.
  3. A histogram of the events versus day of the week.
  4. Facet your graph from (b) by day of the week. Is there a pattern?
ggplot(Trips, aes(sdate)) + 
  geom_density()

Trips <- Trips %>% 
  mutate(day_time = lubridate::hour(sdate) + lubridate::minute(sdate)/60)

ggplot(Trips, aes(day_time)) + 
  geom_density()

Trips <- Trips %>% 
  mutate(weekday = lubridate::wday(sdate, label=TRUE))

ggplot(Trips, aes(weekday)) +
  geom_bar()

ggplot(Trips, aes(day_time)) + 
  geom_density()+
  facet_wrap(~weekday)

::: {.solution} On weekdays there are spikes during commuting times :::

The variable client describes whether the renter is a regular user (level Registered) or has not joined the bike-rental organization (Causal). Do you think these two different categories of users show different rental behavior? How might it interact with the patterns you found in Exercise 3.1?

Exercise 3.2 (Customer segmentation) Repeat the graphic from Exercise 3.1 (d) with the following changes:

  1. Set the fill aesthetic for geom_density() to the client variable. You may also want to set the alpha for transparency and color=NA to suppress the outline of the density function.
  2. Now add the argument position = position_stack() to geom_density(). In your opinion, is this better or worse in terms of telling a story? What are the advantages/disadvantages of each?
  3. Rather than faceting on day of the week, create a new faceting variable like this: mutate(wday = ifelse(lubridate::wday(sdate) %in% c(1,7), "weekend", "weekday")). What does the variable wday represent? Try to understand the code.
  4. Is it better to facet on wday and fill with client, or vice versa?
  5. Of all of the graphics you created so far, which is most effective at telling an interesting story?
ggplot(Trips) + 
  geom_density(aes(day_time, fill=client, alpha=0.5),
               color=NA)+
  facet_wrap(~weekday)

ggplot(Trips) + 
  geom_density(aes(day_time, fill=client, alpha=0.5),
               color=NA,
               position = position_stack())+
  facet_wrap(~weekday)

::: {.solution} The position stack is able to show total users in a given day, but without the stack is better at showing the contrast between casual and registered users. I think the non stacked graph is better at showing how registered users are the ones using the bikes during commuting times. :::

Trips <- Trips %>% 
  mutate(wday = ifelse(lubridate::wday(sdate) %in% c(1,7), "weekend", "weekday"))

ggplot(Trips) + 
  geom_density(aes(day_time, fill=client, alpha=0.5),
               color=NA)+
  facet_wrap(~wday)

ggplot(Trips) + 
  geom_density(aes(day_time, fill=wday, alpha=0.5),
               color=NA)+
  facet_wrap(~client)

Solution.

  1. wday represents if it’s a weekday or a weekend
  2. I think it’s better to facet by wday and fill with client because I think it makes more more sense to have weekday/weekend as the overall header because the x axis is time.
  3. I think that most recent graph is the most effective, because it shows the distinct contrast between registered and casual users on weekdays.

3.2 Mutating join practice: Spatial patterns

Exercise 3.3 (Visualization of bicycle departures by station) Use the latitude and longitude variables in Stations to make a spatial visualization of the total number of departures from each station in the Trips data. Your map can be static or interactive via leaflet. Here is a good bounding box:

dc_bbox<-c(-77.1,38.87,-76.975,38.95)
register_stadiamaps("5081c024-5bc9-46e4-b756-897ac8e430cd", write = TRUE)
station_departures <- Stations %>%
  left_join(Trips, by=c("name"="sstation")) %>% 
  group_by(name) %>% 
  count() %>% 
  left_join(Stations, by=c("name"="name")) %>% 
  select(name, n, lat, long)

ggmap(get_stadiamap(bbox = dc_bbox, maptype = "stamen_toner_lite", zoom = 13)) +
  geom_point(data=station_departures, aes(x=long, y=lat, size=n, alpha=0.5)) +
  labs(title="Rental Bike Departures in Washington DC", size="Departures")


Exercise 3.4 Only 14.4% of the trips in our data are carried out by casual users.3 Create a map that shows which area(s) of the city have stations with a much higher percentage of departures by casual users. Interpret your map. Hint: you may want to exclude stations with low overall traffic (e.g., under 20 total departures).

casual_departures <- Stations %>%
  left_join(Trips, by=c("name"="sstation")) %>% 
  group_by(name, client) %>% 
  count() %>%
  pivot_wider(names_from = client, values_from = n) %>% 
  mutate(casual_percentage = Casual/(Casual+Registered)) %>% 
  left_join(Stations, by=c("name"="name")) %>%
  select(name, casual_percentage, lat, long)

ggmap(get_stadiamap(bbox = dc_bbox, maptype = "stamen_toner_lite", zoom = 13)) +
  geom_point(data=casual_departures, aes(x=long, y=lat, color=casual_percentage, alpha=0.5, size=3)) +
  labs(title="Casual Rider Departures Washington DC", color="Percentage of Casual Riders")

::: {.solution} There are more casual riders on the south part of the city by the river. I’m guessing this is the more touristy part of the city, downtown, by the water, etc. :::


3.3 Filtering join practice: Spatiotemporal patterns

Exercise 3.5 (High traffic points)

  1. Make a table with the ten station-date combinations (e.g., 14th & V St., 2014-10-14) with the highest number of departures, sorted from most departures to fewest. Hint: as_date(sdate) converts sdate from date-time format to date format.
  2. Use a join operation to make a table with only those trips whose departures match those top ten station-date combinations from part (a).
  3. Group the trips you filtered out in part (b) by client type and wday (weekend/weekday), and count the total number of trips in each of the four groups. Interpret your results.
station_date <- Stations %>%
  left_join(Trips, by=c("name"="sstation")) %>% 
  mutate(date = as_date(sdate)) %>% 
  group_by(name, date) %>%
  count() %>% 
  arrange(desc(n)) %>% 
  head(10)
station_date
## # A tibble: 10 × 3
## # Groups:   name, date [10]
##    name                                 date           n
##    <chr>                                <date>     <int>
##  1 Columbus Circle / Union Station      2014-11-12    11
##  2 Jefferson Dr & 14th St SW            2014-12-27     9
##  3 Lincoln Memorial                     2014-10-05     9
##  4 Lincoln Memorial                     2014-10-09     8
##  5 17th St & Massachusetts Ave NW       2014-10-06     7
##  6 Columbus Circle / Union Station      2014-10-02     7
##  7 Georgetown Harbor / 30th St NW       2014-10-25     7
##  8 Massachusetts Ave & Dupont Circle NW 2014-10-01     7
##  9 New Hampshire Ave & T St NW          2014-10-16     7
## 10 14th & V St NW                       2014-11-07     6
busy_day_trips <- Trips %>% 
  mutate(date = as_date(sdate)) %>% 
  semi_join(station_date, by=c("sstation"="name", "date"="date"))
busy_day_trips %>% 
  arrange(sstation) %>% 
  select(sstation, client, weekday)
## # A tibble: 78 × 3
##    sstation                       client     weekday
##    <chr>                          <chr>      <ord>  
##  1 14th & V St NW                 Registered Fri    
##  2 14th & V St NW                 Registered Fri    
##  3 14th & V St NW                 Registered Fri    
##  4 14th & V St NW                 Registered Fri    
##  5 14th & V St NW                 Registered Fri    
##  6 14th & V St NW                 Registered Fri    
##  7 17th St & Massachusetts Ave NW Registered Mon    
##  8 17th St & Massachusetts Ave NW Registered Mon    
##  9 17th St & Massachusetts Ave NW Registered Mon    
## 10 17th St & Massachusetts Ave NW Registered Mon    
## # ℹ 68 more rows
busy_day_trips %>% 
  group_by(wday, client) %>% 
  count()
## # A tibble: 4 × 3
## # Groups:   wday, client [4]
##   wday    client         n
##   <chr>   <chr>      <int>
## 1 weekday Casual         9
## 2 weekday Registered    44
## 3 weekend Casual        20
## 4 weekend Registered     5

Solution. The busiest stations on a given day were mainly registered riders on weekdays, and then casual riders on weekends. I suspect the large number of registered riders is due to a public transit shutdown, causing bus and subway commuters to bike instaed. The higher number of casual riders on weekends is likely a large group of tourists all renting bikes from the same place together.


  1. There is also a right_join() that adds variables in the reverse direction from the left table to the right table, but we do not really need it as we can always switch the roles of the two tables.↩︎

  2. Important: To avoid repeatedly re-reading the files, start the data import chunk with {r cache = TRUE} rather than the usual {r}.↩︎

  3. We can compute this statistic via mean(Trips$client=="Casual").↩︎